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Abstract 

The LASSO-Patternsearch algorithm is proposed as a two-step method to identify 
clusters or patterns of multiple risk factors for outcomes of interest in demographic and 
genomic studies. The predictor variables are dichotomous or can be coded as dichoto- 
mous. Many diseases are suspected of having multiple interacting risk factors acting in 
concert, and it is of much interest to uncover higher order interactions or risk patterns 
when they exist. The patterns considered here are those that arise naturally from the 
log linear expansion of the multivariate Bernoulli density. The method is designed for 
the case where there is a possibly very large number of candidate patterns but it is be- 
lieved that only a relatively small number are important. A LASSO is used to greatly 
reduce the number of candidate patterns, using a novel computational algorithm that 
can handle an extremely large number of unknowns simultaneously. Then the patterns 
surviving the LASSO are further pruned in the framework of (parametric) generalized 
linear models. A novel tuning procedure based on the GACV for Bernoulli outcomes, 
modified to act as a model selector, is used at both steps. We first applied the method 
to myopia data from the population-based Beaver Dam Eye Study, exposing physio- 
logically interesting interacting risk factors. In particular, we found that for an older 
cohort the risk of myopic changes in refraction for smokers is reduced by taking vita- 
mins while the risk for non-smokers is independent of the "taking vitamins" variable. 
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This is in agreement with the general result that smoking reduces the absorption of 
vitamins, and certain vitamins have been associated with eye health. We then applied 
the method to data from a generative model of Rheumatoid Arthritis based on Prob- 
lem 3 from the Genetic Analysis Workshop 15, successfully demonstrating its potential 
to efficiently recover higher order patterns from attribute vectors of length typical of 
genomic studies. 

AMS 2000 Subject Classifications: Primary: 62G05, 62G08, 90C25; Secondary: 92D30, 
92D20. Keywords and phrases: LASSO, GACV, BGACV, variable and pattern selec- 
tion, myopic change, SNP selection. 

1 Introduction 

We consider the problem which occurs in demographic and genomic studies when there are a 
large number of risk factors that potentially interact in complicated ways to induce elevated 
risk. The goal is to search for important patterns of multiple risk factors among a very large 
number of candidate patterns, with results that are easily interpretable. In this work the 
LASSO-Patternsearch algorithm (LPS) is proposed for this task. All variables are binary, 
or have been dichotomized before the analysis, at the risk of some loss of information; this 
allows the study of much higher order interactions than would be possible with risk factors 
with more than several possible values, or with continuous risk factors. Thus LPS may, if 
desired, be used as a preprocessor to select clusters of variables that are later analyzed in their 
pre-dichotomized form, see [38]. Along with demographic studies, a particularly promising 
application of LPS is to the analysis of patterns or clusters of SNPs (Single Nucleotide 
Polymorphisms) or other genetic variables that are associated with a particular phenotype, 
when the attribute vectors are very large and there exists a very large number of candidate 
patterns. LPS is designed specifically for the situation where the number of candidate 
patterns may be very large, but the solution, which may contain high order patterns, is 
believed to be sparse. LPS is based on the log linear parametrization of the multivariate 
Bernoulli distribution [35] to generate all possible patterns, if feasible, or at least a large 
subset of all possible patterns up to some maximum order. LPS begins with a LASSO 
algorithm (penalized Bernoulli likelihood with an li penalty), used with a new tuning score, 
BGACV. BGACV is a modified version of the GACV score [SB] to target variable selection, 
as opposed to Kullback-Liebler distance, which is the GACV target. A novel numerical 
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algorithm is developed specifically for this step, which can handle an extremely large number 
of basis functions (patterns) simultaneously. This is in particular contrast to most of the 
literature in the area, which uses greedy or sequential algorithms. The patterns surviving 
this process are then entered into a parametric linear logistic regression to obtain the final 
model, where further sparsity may be enforced via a backward elimination process using the 
BGACV score as a stopping criterion. Properties of LPS will be examined via simulation, 
and in demographic data by scrambling responses to establish false pattern generation rates. 

There are many approaches that can model data with binary covariates and binary re- 
sponses, see, for example CART PQ, LOTUS [2], Logic regression [27] and Stepwise Penalized 
Logistic Regression (SPLR) [25]. Logic regression is an adaptive regression methodology 
that constructs predictors as Boolean combinations of binary covariates. It uses simulated 
annealing to search through the high dimensional covariate space and uses five-fold cross 
validation and randomization based hypothesis testing to choose the best model size. SPLR 
is a variant of logistic regression with l 2 penalty to fit interaction models. It uses a forward 
stepwise procedure to search through the high dimensional covariate space. The model size 
is chosen by an AIC- or BIC-like score and the smoothing parameter is chosen by 5-fold 
cross validation. For Gaussian data the LASSO was proposed in [32] as a variant of linear 
least squares ridge regression with many predictor variables. As proposed there, the LASSO 
minimized the residual sum of squares subject to a constraint that the sum of absolute 
values of the coefficients of the basis functions be less than some constant, say t. This is 
equivalent to minimizing the residual sum of squares plus a penalty which is some multiple A 
(depending on t) of the sum of absolute values (Zi penalty). It was demonstrated there that 
this approach tended to set many of the coefficients to zero, resulting in a sparse model, a 
property not generally obtaining with quadratic penalties. A similar idea was exploited in 
[3] to select a good subset of an overcomplete set of nonorthogonal wavelet basis functions. 
The asymptotic behavior of LASSO type estimators was studied in [IB], and [23] discussed 
computational procedures in the Gaussian context. More recently [5] discussed variants of 
the LASSO and methods for computing the LASSO for a continuous range of values of A in 
the Gaussian case. Variable selection properties of the LASSO were examined in [20J in some 
special cases, and many applications can be found on the web. In the context of nonparamet- 
ric ANOVA decompositions [38] used an overcomplete set of basis functions obtained from 
a Smoothing Spline ANOVA model, and used i\ penalties on the coefficients of main effects 
and low order interaction terms, in the spirit of [3] . The present paper uses some ideas from 
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[38] , although the basis function set here is quite different. Other work has implemented l\ 
penalties along with quadratic (reproducing kernel square norm) penalties to take advantage 
of the properties of both kinds of penalties, see for example [12] [19] [37] [39] . 

The rest of the article is organized as follows. In Section 2 we describe the first (LASSO) 
step of the LPS including choosing the smoothing parameter by the B-type Generalized 
Approximate Cross Validation (BGACV), "B" standing for the prior belief that the solu- 
tion is sparse, analogous to BIC. An efficient algorithm for the LASSO step is presented 
here. Section 3 describes the second step of the LASSO-Patternsearch algorithm, utilizing 
a parametric logistic regression, again tuned by BGACV. Section 4 presents three simula- 
tion examples, designed to demonstrate the properties of LPS as well as comparing LPS to 
Logic regression and SPLR. Favorable properties of LPS are exhibited in models with high 
order patterns and correlated attributes. Section 5 applies the method to myopic changes in 
refraction in an older cohort from the Beaver Dam Eye Study [15], where some interesting 
risk patterns including one involving smoking and vitamins are found. Section 6 applies the 
method to data from a generative model of Rheumatoid Arthritis Single Nucleotide Poly- 
morphisms adapted from the Genetics Analysis Workshop 15 [6], which examines the ability 
of the algorithm to recover third order patterns from extremely large attribute vectors. Sec- 
tion 7 notes some generalizations, and, finally, Section 8 gives a summary and conclusions. 
Appendix A derives the BGACV score; Appendix B gives details of the specially designed 
code for the LASSO which is capable of handling a very large number of patterns simulta- 
neously; Appendix C shows the detailed results of Simulation Example 3. When all of the 
variables are coded as 1 in the risky direction, the model will be sparsest among equivalent 
models. Appendix D gives a lemma describing what happens when some of the variables are 
coded with the opposite direction as 1. 

2 The LASSO-Patternsearch Algorithm 

2.1 The LASSO-Patternsearch Algorithm - Step 1 

Considering n subjects, for which p variables are observed, we first reduce continuous vari- 
ables to "high" or "low" in order to be able to examine very many variables and their 
interactions simultaneously. We will assume that for all or most of the the p variables, we 
know in which direction they are likely to affect the outcome or outcomes of interest, if at all. 
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For some variables, for example smoking, it is clear for most endpoints in which direction 
the smoking variable is likely to be "bad" if it has any effect, and this is true of many but 
not all variables. For some continuous variables, for example systolic blood pressure, higher 
is generally "worse" , but extremely low can also be "bad" . For continuous variables, we need 
to initially assume the location of a cut point on one side of which the variable is believed 
to be "risky" ("high") and the other side "not risky" ("low"). For systolic blood pressure 
that might, for example, be 140 mmHg. For an economic variable, that might be something 
related to the current standard for poverty level. If the "risky" direction is known for most 
variables the results will be readily interpretable. Each subject thus has an attribute vector 
of p zeroes and ones, describing whether each of their p attributes is on one side or the other 
of the cutoff point. The LASSO-Patternsearch approach described below is able to deal with 
high order interactions and very large p. The data is x(i), i — 1, • ■ ■ , n}, where yi £ {0, 1} 
codes the response, x(i) = (x±(i), X2{i), • • • ,x p (i)) is the attribute vector for the ith subject, 
Xj(i) £ {0,1}. Define the basis functions Bj 1 j 2 j r (x) = Y\j_ 1 Xj t , that is, Bj 1 j 2 j r (x) = 1 if 
all l's and otherwise. We will call B jlh .. jr (x) an rth order pattern. Let q 
be the highest order we consider. Then there will be Nb = J2l=o (^) patterns. If q = p, we 
have a complete set of Nb = 2 P such patterns (including the constant function /x), spanning 
all possible patterns. If q = 1 only first order patterns (henceforth called "main effects") 
are considered, if q = 2 main effects and second order patterns are considered, and so forth. 
Letting p(x) = Prob[y = l\x] and the logit (log odds ratio) be f(x) = log\p(x)/(l — p(x))], 
we estimate / by minimizing 

I x (yJ) = C(yJ) + XJ(f) (1) 

where C(y, f) is - times the negative log likelihood: 

1 " 

/) = - Eh/i/WO) + Mi + e /(x(l)) )] (2) 
n i=i 

with 

n b -i 

f{x) = fi+ E c e B e (x), (3) 
i=\ 

where we are relabeling the iV^ — 1 (non-constant) patterns from 1 to Nb — 1, and 

N B -1 

J(f) = E H (4) 
i=i 
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If all possible patterns are included in ([3]) then / there is the most general form of the log 
odds ratio for y given x obtainable from the log linear parametrization of the multivariate 
Bernoulli distribution given in [35]. In Step 1 of the LASSO-Patternsearch we minimize ([I]) 
using the BGACV score to choose A. The next section describes the BGACV score and the 
kinds of results it can be expected to produce. 

2.2 B-type Generalized Approximate Cross Validation (BGACV) 

The tuning parameter A in ([1]) balances the trade-off between data fitting and the sparsity of 
the model. The bigger A is, the sparser the model. The choice of A is generally a crucial part 
of penalized likelihood methods and machine learning techniques like the Support Vector 
Machine. For smoothing spline models with Gaussian data, [M] proposed ordinary leave- 
out-one cross validation (OCV) . Generalized Cross Validation (GCV), derived from OCV, 
was proposed in |1][H], and theoretical properties were obtained in [21] and elsewhere. For 
smoothing spline models with Bernoulli data and quadratic penalty functionals, [36] derived 
the Generalized Approximate Cross Validation (GACV) from an OCV estimate following 
the method used to obtain GCV. In [38] GACV was extended to the case of Bernoulli data 
with continuous covariates and l\ penalties. 

The derivation of the GACV begins with a leaving-out-one likelihood to minimize an 
estimate of the comparative Kullback-Leibler distance (CKL) between the true and estimated 
model distributions. The ordinary leave-out-one cross validation score for CKL is 

1 n 

CV{\) = - J2l-yJ [ r ] (^)) + log(l + e^W')], (5) 

n i=l 

where fx is the minimizer of the objective function (pQ), and f[ is the minimizer of (0Q) with 
the zth data point left out. Through a series of approximations and an averaging step as 
described in Appendix A, we obtain the GACV score appropriate to the present context. It 
is a simple to compute special case of the GACV score in [38] : 

GACV(X) = - Yl-Vifxi + log(l + e^)] + -trff ^ 1 Viiyi ~f u) , (6) 
n i^i n [n- N Bo ) 

here H = B* (B*'W B*)~ x B*' , where W is the n x n diagonal matrix with iith element the 
estimated variance at x(i) {pi\{l — Pi\)) and B* is the n x Nb design matrix for the A^ 
non-zero q in the model. The quantity trH J^-Nb ) — ~ P^ a y s the role of degrees of free- 
dom here. As is clear from the preceding discussion, the GACV is a criterion whose target 
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is the minimization of the (comparative) Kullback-Liebler distance from the estimate to 
the unknown "true" model. By analogy with the Gaussian case (where the predictive mean 
squared error and comparative Kullback-Liebler distance coincide), it is known that optimiz- 
ing for predictive mean square error and optimizing for model selection when the true model 
is sparse are not in general the same thing. This is discussed in various places, for example, 
see [TO] which discusses the relation between AIC and BIC, AIC being a predictive criterion 
and BIC, which generally results in a sparser model, being a model selection criterion, with 
desirable properties when the "true" model is of fixed (low) dimension as the sample size 
gets large. See also [13] , [20] and particularly our remarks at the end of Appendix |A] In the 
AIC to BIC transformation, if 7 is the degrees of freedom for the model, then BIC replaces 
7 with (-log 71)7. By analogy we obtain a model selection criterion, BGACV, from GACV 
as follows. Letting 7 be the quantity playing the role of degrees of freedom for signal in the 
Bernoulli-/i penalty case, 

(n - N Bo ) 

7 is replaced by (| log 71)7 to obtain 

BGACV(X) = I ±[-y ihi + log(l + e>»)] + '-^trH^ *<* " P «> - 



nf^ n 2 (n-N Bo ) 

We illustrate the difference of empirical performances between GACV and BGACV on 
a "true" model with a small number of strong patterns. Let (AJ\ A|) T , (A^Ag) 7 and 
(A3, Ag) T be independently distributed from a bivariate normal distribution with mean 0, 
variance 1 and covariance 0.7. A« = 1 if X* > and otherwise, i = 1,2, •■-,6. A7 
is independent of the others and takes two values {1,0}, each with a probability of 0.5. 
A = (A 1; • • ■ , A 7 ). The sample size n = 800. Three patterns that consist of six variables are 
important, and A 7 is noise. The true logit is 

f(x) = -2 + 1.55i(x) + 1.5B 23 (x) + 2B i56 (x). (9) 

This problem is very small so we chose the maximum order q = p = 7, so 127 patterns plus 
a constant term are entered in the trial model. We ran the simulation 100 times and the 
result is shown in Table [H Both GACV and BGACV select the three important patterns 
perfectly, but GACV selects more noise patterns than BGACV. Note that a total of 2 7 — 4 
noise patterns have been considered in each run. The maximum possible number in the 
last column is 100 x (2 7 - 4) = 12,400. Neither GACV or BGACV is doing a bad job 
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but we will discuss a method to further reduce the number of noise patterns in Section 3. 
Figure [T] shows the scores of these two criteria in the first data set. BGACV selects a bigger 
smoothing parameter than GACV does. These scores are not continuous at the the point 
where a parameter becomes zero so we see jumps in the plots. 

Table 1: The results of Simulation Example 1. The second through fourth columns are the 
appearance frequencies of the three important patterns in the 100 runs. The last column 
is the total appearance frequency of all other patterns. The second and third row compare 
GACV and BGACV within the first step of LPS. The fourth through sixth rows, which 
compare the full LPS with Logic regression and SPLR will be discussed in Section 14.11 



Method 


B 1 


E>23 


-B456 


other 


GACV 


100 


100 


100 


749 


BGACV 


100 


100 


100 


568 


LPS 


97 


96 


98 


34 


Logic 


100 


94 


94 


64 


SPLR 


100 


90 


55 


426 



2.3 Computation 

From a mathematical point of view, this optimization problem ([1]) is the same as the likeli- 
hood basis pursuit (LBP) algorithm in [38], but with different basis functions. The solution 
can easily be computed via a general constrained nonlinear minimization code such as MAT- 
LAB's f mincon on a desktop, for a range of values of A, provided n and Nb are not too large. 
However, for extremely large data sets with more than a few attributes p (and therefore a 
large number Nb of possible basis functions), the problem becomes much more difficult to 
solve computationally with general optimization software, and algorithms that exploit the 
structure of the problem are needed. We design an algorithm that uses gradient information 
for the likelihood term in ([T|) to find an estimate of the correct active set (that is, the set 
of components q that are zero at the minimizer). When there are not too many nonzero 
parameters, the algorithm also attempts a Newton-like enhancement to the search direction, 
making use of the fact that first and second partial derivatives of the function in ([I]) with 
respect to the coefficients q are easy to compute analytically once the function has been 
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-2.6 -2.5 -2.4 -2.3 -2.2 -2.1 -2 -1.9 -1.8 -1.7 -1.6 

logio(A) 



Figure 1: Comparison of BGACV and GACV in the first data set of Simulation Example 1. 
The solid dots are the minima. BGACV selects a bigger A than GACV does. 

evaluated at these values of q. It is not economical to compute the full Hessian (the matrix 
of second partial derivatives), so the algorithm computes only the second derivatives of the 
log likelihood function L with respect to those coefficients q that appear to be nonzero at 
the solution. For the problems that the LASSO-Patternsearch is designed to solve, just a 
small fraction of these Nb coefficients are nonzero at the solution. This approach is similar to 
the two-metric gradient projection approach for bound-constrained minimization, but avoids 
duplication of variables and allows certain other economies in the implementation. 

The algorithm is particularly well suited to solving the problem (pQ) for a number of 
different values of A in succession; the solution for one value of A provides an excellent 
starting point for the minimization with a nearby value of A. Further details of this approach 
can be found in Appendix [B] 

3 The LASSO-Patternsearch Algorithm - Step 2 

In Step 2 of LASSO-Patternsearch algorithm, the Nb patterns surviving Step 1 are entered 
into a linear logistic regression model using glmf it in MATLAB and pattern selection is 
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then carried out by the backward elimination method. We take out one of the Nb patterns 
at a time, fit the model with the remaining patterns and compute the tuning score. The 
pattern that gives the best tuning score to the model after being taken out is removed from 
the model. This process continues until there are no patterns in the model. A final model is 
chosen from the pattern set with the best tuning score. Note that all-subset selection is not 
being done, since this will introduce an overly large number of degrees of freedom into the 
process. 

If copious data is available, then a tuning set can be used to create the tuning score, 
but this is frequently not the case. Inspired by the tuning method in the LASSO step, we 
propose the BGACV score for the parametric logistic regression. The likelihood function is 
smooth with respect to the parameters so the robust assumption that appears in Appendix 
IA1 is not needed. Other than that, the derivation of the BGACV score for parametric 
logistic regression follows that in Appendix |A] Let s be the current subset of patterns under 
consideration and B s be the design matrix. The BGACV score for logistic regression is the 
same as © with H = B S (B' S W B^- 1 B' s . 

BGACV(s) = I ±[-yJ sl + log(l + e^)] + l^ tr ff g^izM ; (10 ) 
n i=1 n I [n-J\ Bo ) 

where f S i is the estimated log odds ratio for observation % and p S i is the corresponding 

probability. The BGACV scores are computed for each model that is considered in the 

backward elimination procedure, and the model with the smallest BGACV score is taken as 

the final model. 

The following is a summary of the LASSO-Patternsearch algorithm: 

1. Solve ([T]) and choose A by BGACV. Keep the patterns with nonzero coefficients. 

2. Put the patterns with nonzero coefficients from Step 1 into a logistic regression model 
and select models by the backward elimination method with the selection criterion 
being BGACV. 

For simulated data, the results can be compared with the simulated model. For observational 
data, a selective examination of the data will be used to validate the results. Other logistic 
regression codes, e. g. from R or SAS can be used instead of glmf it here. 
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4 Simulation Studies 



In this section we study the empirical performance of the LPS through three simulated 
examples. The first example continues with simulated data in Section 12.21 There are three 
pairs of correlated variables and one independent variable. Three patterns are related to 
the response. The second example has only one high order pattern. The correlation within 
variables in the pattern is high and the correlation between variables in the pattern and 
other variables varies. The last example studies the performance of our method under 
various correlation settings. We compare LPS with two other methods, Logic regression [27J 
and Stepwise Penalized Logistic Regression (SPLR) [26]. We use the R package LogicReg 
to run Logic regression and the R package stepPlr to run SPLR. The number of trees and 
number of leaves in Logic regression are selected by 5-fold cross validation. The smoothing 
parameter in SPLR is also selected by 5-fold cross validation, and then the model size is 
selected by a BIC-like score based on an approximation to a degrees of freedom reproduced 
in the Comments section of Appendix |A] 

4.1 Simulation Example 1 

In this example we have 7 variables and the sample size is 800. The true logit is f(x) = 
—2 + 1.5Bi(x) + 1.5B 2 3(x) + 2£> 456 (x). The distribution of the covariates was described in 
Section 12.21 We simulated 100 data sets according to this model and ran all three methods 
on these data sets. The results are shown in the last three rows of Table [TJ 

Let's compare LPS with the LASSO step (third row in Table Q]) first. LPS misses all 
three patterns a few times. However, these numbers are still very close to 100 and more 
importantly, LPS significantly reduced the number of noise patterns, from over 500 to 34. 
Here we see why a second step is needed after the LASSO step. Now let's look at LPS 
compared with the other two methods. Logic regression picks the first term perfectly but it 
doesn't do as well as LPS on the remaining two patterns. It also selects more noise patterns 
than LPS. SPLR does worse, especially on the last pattern. It is not surprising because this 
example is designed to be difficult for SPLR, which is a sequential method. In order for 
to be in the model, at least one main effect of X±, and X$ should enter the model first, 
say A 4 . And then a second order pattern should also enter before -8456- It could be -B45 or 
i?4 6 . However, none of these lower order patterns are in the true model. This makes it very 
hard for SPLR to consider -B456, and the fact that variables in the higher order pattern are 
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correlated with variables in the lower order patterns makes it even harder. We also notice 
that SPLR selects many more noise patterns than LPS and Logic regression. Because of the 
way it allows high order patterns to enter the model, the smallest SPLR model has 6 terms, 
Bi, one of B 2 and B 3 , B 2 s, one main effect and one second order pattern in B^q, and B^q. 
Conditioning on the appearance frequencies of the important patterns, SPLR has to select 
at least 90 + 2 x 55 = 200 noise patterns. The difference, 426 — 200 = 226 is still much bigger 
than 34 selected by LPS. 

4.2 Simulation Example 2 

We focus our attention on a high order pattern in this example. Let X{ through XI be 
generated from a normal distribution with mean 1 and variance 1. The correlation between 
any of these two is 0.7. X,i = 1 if X* > and otherwise for i = 1, • • ■ , 4. X i+4: = X, with 
probability p and Xj + 4 will be generated from Bernoulli(0.84) otherwise for % = 1, 
p takes values 0,0.2,0.5 and 0.7. X = (Xx,---,X 8 ). Note that P(X X = 1) = 0.84 in our 
simulation. The sample size is 2000 and the true logit is f(x) = —2 + 2Bi 234: (x). We consider 
all possible patterns, so q = p = 8. We also ran this example 100 times and the results are 
shown in Table [2j 

LPS does a very good job and it is very robust against increasing p, which governs the 
correlation between variables in the model and other variables. We selected the high order 
pattern almost perfectly and kept the noise patterns below 10 in all four settings. Logic 
regression selects the important pattern from 70 to 80 times and noise patterns over 130 
times. There is a mild trend that it does worse as the correlation goes up, but the last one is 
an exception. SPLR is robust against the correlation but it doesn't do very well. It selects 
the important pattern from 50 to 60 times and noise patterns over 500 times. From this 
example we can see that LPS is extremely powerful in selecting high order patterns. 

4.3 Simulation Example 3 

The previous two examples have a small number of variables so we considered patterns of 
all orders. To demonstrate the power of our algorithm, we add in more noise variables in 
this example. The setting is similar to Example 2. Let X^ through X\ be generated from a 
normal distribution with mean 1 and variance 1. The correlation between any of these two is 
pi and pi takes values in 0,0.2, 0.5 and 0.7. Xi = 1 if X* > and otherwise, i — 1,2, 3, 4. 
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Table 2: The results of simulation example 2. The numerators are the appearance frequencies 
of -B1234 and the denominators are the appearance frequencies of all noise patterns. 



p 





0.2 


0.5 


0.7 


LPS 


98/10 


100/9 


97/8 


98/5 


Logic 


82/132 


73/162 


72/237 


74/157 


SPLR 


53/602 


60/576 


57/581 


58/552 



Xj + 4 = Xi with probability p 2 and Aj + 4 will be generated from Bernoulli(0.84) otherwise 
for i — 1, 2, 3, 4. p 2 takes values 0, 0.2, 0.5 and 0.7 also. X 9 through X 2Q are generated from 
Bernoulli(0.5) independently. X = (X±, ■ ■ ■ ,X 20 ). The sample size n = 2000 and the true 
logit is 

f(x) = -2 + 2B 9 (x) + 2B 67 (x) + 2B 12U (x). 

Unlike the previous two examples, we consider patterns only up to the order of 4 because of 
the large number of variables. That gives us a total of ( 2 °) + (^f) + ■ ■ ■ + ( 2 4 °) = 6196 basis 
functions. 

Figure [2] shows the appearance frequencies of the high order pattern -B1234. From the 
left plot we see that LPS dominates the other two methods. All methods are actually very 
robust against pi, the correlation within the high order pattern. There is a huge gap between 
the two blue lines, which means SPLR is very sensitive to p 2 which governs the correlation 
between variables in the high order pattern and others. This is confirmed by the right plot, 
where the blue lines decrease sharply as p 2 increases. We see similar but milder behavior in 
Logic regression. This is quite natural because the problem becomes harder as the the noise 
variables become more correlated with the important variables. However, LPS handles this 
issue quite well, at least in the current setting. We see a small decrease in LPS as p 2 goes 
up but those numbers are still very close to 100. The performance of these methods on the 
second order pattern is generally similar but the trend is less obvious as a lower order 
pattern is easier for most methods. The main effect B$ is selected almost perfectly by every 
method in all settings. More detailed results are presented in Table [7] in Appendix O 
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Figure 2: Appearance frequency of the high order pattern -B1234 in simulation example 3. In 
the left panel, the x-axis is p\. P2 is 0.2 for the dashed line and 0.7 for the solid line. In 
the right panel, the x-axis is pi- pi is 0.2 for the dashed line and 0.7 for the solid line. The 
red triangles represent LPS, the blue diamonds represent Logic regression [27] and the green 
circles represent SPLR |26j. 
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5 The Beaver Dam Eye Study 



The Beaver Dam Eye Study (BDES) is an ongoing population-based study of age-related 
ocular disorders including cataract, age-related macular degeneration, visual impairment and 
refractive errors. Between 1987 and 1988, a private census identified 5924 people aged 43 
through 84 years in Beaver Dam, WI. 4926 of these people participated the baseline exam 
(BD I) between 1988 and 1990. Five (BD II), ten (BD III) and fifteen (BD IV) year follow-up 
data have been collected and there have been several hundred publications on this data. A 
detailed description of the study can be found in [T5] . 

Myopia, or nearsightedness, is one of the most prevalent world-wide eye conditions. My- 
opia occurs when the eyeball is slightly longer than usual from front to back for a given 
level of refractive power of the cornea and lens and people with myopia find it hard to see 
objects at a distance without a corrective lens. Approximately one-third of the population 
experience this eye problem and in some countries like Singapore, more than 70% of the 
population have myopia upon completing college [29]. It is believed that myopia is related 
to various environmental risk factors as well as genetic factors. Refraction is the continuous 
measure from which myopia is defined. Understanding how refraction changes over time 
can provide further insight into when myopia may develop. Five and ten- year changes of 
refraction for the BDES population were summarized in p2][l8]. We will study five-year 
myopic changes in refraction (hereinafter called "myopic change") in an older cohort aged 
60 through 69 years. We focus on a small age group since the change of refraction differs for 
different age groups. 

Based on [18] and some preliminary analysis we carried out on this data, we choose 
seven risk factors: sex, inc, jomyop, catct, pky, asa and vtm (sex, income, juvenile myopia, 
nuclear cataract, packyear, aspirin and vitamins). Descriptions and binary cut points are 
presented in Table [3J For most of these variables, we know which direction is bad. For 
example, male gender is a risk factor for most diseases and smoking is never good. The 
binary cut points are somewhat subjective here. Regarding pky, a pack a day for 30 years, 
for example, is a fairly substantial smoking history, catct has five levels of severity and we cut 
it at the third level. Aspirin (asa) and vitamin supplements (vtm) are commonly taken to 
maintain good health so we treat not taking them as risk factors. Juvenile myopia jomyop is 
assessed from self-reported age at which the person first started wearing glasses for distance. 
For the purposes of this study we have defined myopic change as a change in refraction 
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of more than -0.75 diopters from baseline exam to the five year followup; accordingly y is 
assigned 1 if this change occurred and otherwise. There are 1374 participants in this age 
group at the baseline examination, of which 952 have measurements of refraction at the 
baseline and the five-year follow-up. Among the 952 people, 76 have missing values in the 
covariates. We assume that the missing values are missing at random for both response and 
covariates, although this assumption is not necessarily valid. However the examination of 
the missingness and possible imputation are beyond the scope of this study. Our final data 
consists of 876 subjects without any missing values in the seven risk factors. 

Table 3: The variables in the myopic change example. The fourth column shows which 
direction is risky. 



code 


variable 


unit 


higher risk 


sex 


sex 




Male 


inc 


income 


$1000 


< 30 


jomyop 


juvenile myopia 


age first wore glasses for distance 


yes before age 21 


catct 


nuclear cataract 


severity 1-5 


4-5 


pky 


packyear 


pack per day x years smoked 


>30 


asa 


aspirin 


taking/not taking 


not taking 


vtm 


vitamins 


taking/not taking 


not taking 



As the data set is small, we consider all possible patterns (q — 7). The first step of the 
LASSO-Patternsearch algorithm selected 8 patterns, given in Table HI 



Table 4: Eight patterns selected at Step 1 in the myopic change data 





Pattern 


Estimate 




Pattern 




Estimate 


1 


constant 


-2.9020 


6 


pky x vtm 




0.6727 


2 


catct 


1.9405 


7 


inc x pky x 


vtm 


0.0801 


3 


asa 


0.3000 


8 


sex x inc x 


jomyop x asa 


0.8708 


4 


inc x pky 


0.2728 


9 


sex x inc x 


catct x asa 


0.2585 


5 


catct x asa 


0.3958 
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Figure 3: The eight patterns that survived Step 1 of LPS. The vertical bars are 90% 
confidence intervals based on linear logistic regression. Red dots mark the patterns that are 
significant at the 90% level. The orange dots are borderline cases, the confidence intervals 
barely covering 0. 
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Figure [3] plots the coefficients of the 8 patterns plus the constant that survived Step 1 
along with 90% confidence intervals. These patterns are then subject to Step 2, backward 
elimination, tuned via BGACV. The final model after the backward elimination step is 

/ = -2.84 + 2.42 x catct + 1.11 x pky x vtm 

+ 1.98 x sex x inc x jomyop x asa + 1.15 x sex x inc x catct x asa. (11) 

The significance levels for the coefficients of the four patterns in this model ffTTT) can be 
formally computed and are, respectively 3.3340e-21, 1.7253e-05, 1.5721e-04, and 0.0428. 
The pattern pky x vtm catches our attention because the pattern effect is strong and both 
variables are controllable. This model tells us that the distribution of y, myopic change 
conditional on pky = 1 depends on catct, as well as vtm and higher order interactions, but 
myopic change conditional on pky = is independent of vtm. This interesting effect can 
easily be seen by going back to a table of the original catct, pky and vtm data (Table [5]). 
The denominators in the risk column are the number of persons with the given pattern and 
the numerators are the number of those with y — 1. The first two rows list the heavy 
smokers with cataract. Heavy smokers who take vitamins have a smaller risk of having 
myopic change. The third and fourth rows list the heavy smokers without cataract. Again, 
taking vitamins is protective. The first four rows suggest that taking vitamins in heavy 
smokers is associated with a reduced risk of getting more myopic. The last four rows list 
all non-heavy smokers. Apparently taking vitamins does not similarly reduce the risk of 
becoming more myopic in this population. Actually, it is commonly known that smoking 
significantly decreases the serum and tissue vitamin level, especially Vitamin C and Vitamin 
E, for example [8]. Our data suggest a possible reduction in myopic change in persons who 
smoke who take vitamins. However, our data are observational and subject to uncontrolled 
confounding. A randomized controlled clinical trial would provide the best evidence of any 
effect of vitamins on myopic change in smokers. 

Since the model is the result of previous data mining, caution in making significance 
statements may be in order. To investigate the probability of the overall procedure to 
generate significant false patterns, we kept the attribute data fixed, randomly scrambled the 
response data and applied the LPS algorithm on the scrambled data. The procedure was 
repeated 1000 times, and in all these runs, 1 main effect, 10 second order, 5 third order 
and just 1 fourth order patterns showed up. We then checked on the raw data. There are 
21 people with the pattern sex x inc x jomyop x asa and 9 of them have myopic change. 
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Table 5: The raw data for cataract, smoking and not taking vitamins. 



catct 


pky 


no vitamins 


risk 


1 


1 


1 


17/23 = 0.7391 


1 


1 





7/14 = 0.5000 





1 


1 


22/137 = 0.1606 





1 





2/49 = 0.0408 


1 





1 


18/51 = 0.3529 


1 








19/36 = 0.5278 








1 


22/363 = 0.0606 











13/203 = 0.0640 



The incidence rate is 0.4286, as compared to the overall rate of 0.137. Note that none of the 
variables in this pattern are involved with variables in the two lower order patterns catct and 
pky x vtm so it can be concluded that the incidence rate is contributed only by the pattern 
effect. People with the other size four pattern sex x inc x catct x asa have an incidence rate 
of 0.7727 (17 out of 22), which can be compared with the incidence rate of all people with 
catct, 0.4919. 

We also applied Logic regression and SPLR on this data set. Logic selected both catct 
and pky x vtm but missed the two high order patterns. Instead, it selected asa as a main 
effect. Note that asa is present in both size four patterns. SPLR selected the same patterns 
as Logic regression with an addition of pky, which is necessary for pky x vtm to be included. 
These results agree with what we have found in the simulation studies: they are not as likely 
as LPS in finding higher order patterns. It is noted that in the original version of LPS |31j . 
Step 1 was tuned by GACV rather than BGACV, and resulted in the above eight patterns 
in Table H] plus four more, but the final model after Step 2 was the same. 

6 Rheumatoid Arthritis and SNPs in a Generative Model 
Based on GAW 15 

Rheumatoid arthritis (RA) is a complex disease with a moderately strong genetic compo- 
nent. Generally females are at a higher risk than males. Many studies have implicated a 
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specific region on chromosome 6 as being related to the risk of RA, recently [TJ, although 
possible regions on other chromosomes have also been implicated. The 15th Genetic Analysis 
Workshop (GAW 15, November 2006 [6]) focused on RA, and an extensive simulation data 
set of cases and controls with simulated single nucleotide polymorphisms (SNPs) was pro- 
vided to participants and is now publicly available [21]. SNPs are DNA sequence variations 
that occur when a single nucleotide in the genome sequence is changed. Many diseases are 
thought to be associated with SNP changes at multiple sites that may interact, thus it is 
important to have tools that can ferret out groups of possibly interacting SNPs. 

We applied LPS to some of the simulated SNP RA GAW 15 data [30]. This provided an 
opportunity to apply LPS in a context with large genetic attribute vectors, with a known 
genetic architecture, as described in [21], and to compare the results against the description of 
the architecture generating the data. We decided to use the GAW data to build a simulation 
study where we modified the model that appears in [30] to introduce a third order pattern, 
and in the process deal with some anomalous minus signs in our fitted model, also observed 
by others [2H] • We then simulated phenotype data from the GAW genotypes and covariates, 
and can evaluate how well the LPS of this paper reproduces the model generating the data 
with the third order pattern. This section describes the results. 

In the simulated genetic data sets [24| genome wide scans of 9187 SNPs were generated 
with just a few of the SNPs linked to rheumatoid arthritis, according to the described model 
architecture. The data simulation was set up to mimic the familial pattern of rheumatoid 
arthritis including a strong chromosome 6 effect. A large population of nuclear families (two 
parents and two offspring) was generated. This population contains close to 2 million sibling 
pairs. From this population, a random sample of 1500 families was selected from among 
families with two affected offspring and another random sample of 2000 families was selected 
from among families where no member was affected. A total of 100 independent (replicate) 
data sets were generated. We randomly picked one offspring from each family in replicate 1 
as our data set. As for covariates we take the 674 SNPs in chromosome 6 that were generated 
as a subset of the genome wide scan data and three environmental variables: age, sex and 
smoking. We created two dummy variables for each SNP since most of them have three 
levels: normal, one variant allele and two variant alleles. For environmental variables female 
gender is treated as a risk factor, smoking is treated as a risk factor and age > 55 is treated 
as a risk factor. We first describe a reanalysis of this data using the LPS algorithm of this 
paper. We began our analysis with a screen step. In this step, each variable is entered 
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into a linear logistic regression model. For SNPs with three levels, both dummy variables 
are entered into the same model. We fit these models and keep the variables with at least 
one p-value less than 0.05. The screen step selected 74 variables (72 SNPs plus sex and 
smoking). We then ran LPS on these 74 variables with q = 2, which generates 10371 basis 
functions. The final model was 

/ = -.62 + .87 x smoking + 1.05 x sex - 2.04 x SiVP6_153_l 

-1.45 x SiVP6_154_l + 2.23 x SiVP6_162_l - 5.60 x SiVP6_153_2, (12) 

where S7VP6_153_1 is SNP number 153 on chromosome 6 with 1 variant allele, SWP6_154_1 
is SNP number 154 on chromosome 6 with 1 variant allele, SWP6_162_1 is SNP number 162 
on chromosome 6 with 1 variant allele and SiVP6_153_2 is SNP number 153 on chromosome 
6 with 2 variant alleles. When the analysis in [30] was presented at the GAW 15 workshop 
in November 2006 the tuning procedure presented here had not been finalized, and both 
Step 1 and Step 2 were tuned against prediction accuracy using replicate 2 as a tuning set. 
The model (|12p obtained here is exactly the same as [3D] . We were pleased to find that 
the in-sample BGACV tuning here was just as good as having a separate tuning set. It 
is interesting to note that in this particular problem, tuning for prediction and for model 
selection apparently led to the same results, although in general this is not necessarily the 
case. 

According to the description [21] of the architecture generating the data the risk of RA 
is affected by two loci (C and D) on chromosome 6, sex, smoking and a sex by locus C 
interaction. It turns out that both SNP6J.53 and S'iVP6_154 are very close to locus C 
on chromosome 6 and SN P6J.62 is very close to locus D on chromosome 6. The LPS 
method picked all important variables without any false positives. The description of the 
data generation architecture said that there was a strong interaction between sex and locus 
C. We didn't pick up a sex by locus C interaction in [30], which was surprising. 

We were curious about the apparent counter-intuitive negative coefficients for both SWP6_153 
patterns and the SWP6_154_1 pattern, which appear to say that normal alleles are risky and 
variant alleles are protective. Others also found an anomalous protective effect for 5'iVP6_154 
normal alleles [28]. We went back and looked at the raw data for 5'iVP6_153 and SNP6A5A 
as a check but actually Table 4 of [30] shows that this protective effect is in the simulated 
data, for whatever reason, and it also shows that this effect is stronger for women than for 
men. We then recoded the S"iVP6_153 and SWP6J.54 responses to reflect the actual effect 
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of these two variables as can be seen in tables of the simulated data. 



Table 6: Fitted and Simulated Models, see text for explanation. 





Variable 1 


Level 1 


Variable 2 


Level 2 


Coef 


Est 




constant 


- 


- 


- 


-4.8546 


-4.6002 




smoking 


- 


- 


- 


0.8603 


0.9901 


Main effects 


SNPQA53 


1 


- 


- 


1.8911 


1.5604 




SNP6_162 


1 






2.2013 


1.9965 




SNP6_154 


2 






0.7700 


1.0808 




sex 




SNP6A53 


1 


0.7848 


0.9984 




sex 




SNPQ_15A 


2 


0.9330 


0.9464 


Second order 


SNP6_153 


2 


SNPQ_15A 


2 


4.5877 


4.2465 


patterns 


SNP6_153 


1 


SNP6_553 


2 


0.4021 







S7VP6_154 


2 


SNP6A90 


1 


0.3888 





Added 


Third order pattern 


sex x SNPQ_W8 


;_2 x £WP6_334_2 


3 


2.9106 



Table 6 shows the results. The new fitted model, above the double line, has four main 
effects and five second order patterns. The estimated coefficients are given in the column 
headed "Coef. The sex x SNP6A53 and sex x SNP6A54 are there as expected. Two weak 
second order patterns involving S7VP6_553 and 5'iVP6_490 are fitted, but do not appear to 
be explained by the simulation architecture. This model resulted from fitting with q = 2. 
Then the LPS algorithm was run to include all third order patterns (q = 3) of the 74 
variables which passed the screen step. This generated 403,594 basis functions. No third 
order patterns were found, and the fitted model was the same as in the q = 2 case. To see if 
a third order pattern would be found if it were there, we created a generative model with the 
four main effects and five second order patterns of Table 1, with their coefficients from the 
"Coef column, and added to it a third order pattern sex x S7VP6_108_2 x S'iVP6_334_2 with 
coefficient 3. The two SNPs in the third order pattern were chosen to be well separated in 
chromosome 6 from the reported gene loci. LPS did indeed find the third order pattern. The 
estimated coefficients are found under the column headed "Est". No noise patterns were 
found, and the two weak second order patterns in the model were missed. However, the 
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potential for the LPS algorithm to find higher order patterns is clear. Further investigation 
of the properties of the method in genotype-phenotype scenarios is clearly warranted, taking 
advantage of the power of the LASSO algorithm to handle a truly large number of unknowns 
simultaneously. Run time was 4.5 minutes on an AMD Dual-Core 2.8 GHz machine with 64 
GB memory. Using multiple runs with clever designs to guarantee that every higher order 
pattern considered is in at least one run, will allow the analysis of much larger SNP data 
sets with tolerable computer cost. 

7 Discussion 

In any problem where there are a large number of highly interacting predictor variables that 
are or can be reduced to dichotomous values, LPS can be profitably used. If the "risky" 
direction (with respect to the outcome of interest) is known for all or almost all of the 
variables, the results are readily interpretable. If the risky direction is coded correctly for 
all of the variables, the fitted model can be expected to be sparser than that for any other 
coding. However, if a small number of risky variables are coded in the "wrong" way, this 
usually can be detected. The method can be used as a preprocessor when there are very 
many continuous variables in contention, to reduce the number of variables for more detailed 
nonparametric analysis. 

LPS, using the algorithm of Appendix [B] is efficient. On an Intel Quad-core 2.66 GHz 
machine with 8GB memory the LPS Steps 1 and 2 can do 90,000 basis functions in 4.5 
minutes. On an AMD Dual-Core 2.8 GHz machine with 64 GB memory the algorithm did 
LPS with 403,594 basis functions in 4.5 minutes. It can do 2,000,000 basis functions in 1.25 
hours. On the same AMD machine, problems of the size of the myopia data (128 unknowns) 
can be solved in a few seconds and the GAW 15 problem data (10371 unknowns) was solved 
in 1.5 minutes. 

A number of considerations enter into the choice of q. If the problem is small and the 
user is interested in high order patterns, it doesn't hurt to include all possible patterns; if 
the problem is about the size of Simulation Example 3, q = 4 might be a good choice; For 
genomic data the choice of q can be limited by extremely large attribute vectors. In genomic 
or other data where the existence of a very small number of important higher patterns is 
suspected, but there are too many candidates to deal with simultaneously, it may be possible 
to overcome the curse of dimensionality with multiple screening levels and multiple runs. For 
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example, considering say, third or even fourth order patterns, variables could be assigned 
to doable sized runs so that every candidate triple or quadruple of variables is assigned 
to at least one run. With our purpose built algorithm, the approach is quite amenable 
to various flavors of exploratory data mining. When a computing system such as Condor 



(http://www.cs.wisc.edu/condor/) is available, many runs can compute simultaneously. 

Many generalizations are available. Two classes of models where the LASSO-Patternsearch 
approach can be expected to be useful are the multicategory end points model in [22] [33] . 
where an estimate is desired of the probability of being in class k when there are K possible 
outcomes; another is the multiple correlated endpoints model in [9]. In this latter model, 
the correlation structure of the multiple endpoints can be of interest. Another generalization 
allows the coefficients q to depend on other variables; however, the penalty functional must 
involve l\ penalties if it is desired to have a convex optimization problem with good sparsity 
properties with respect to the patterns. In studies with environmental as well as genomic 
data selected interactions between SNP patterns and continuous covariates can be examined 
[38] : the numerical algorithm can be used on large collections of basis functions that induce 
a reasonable design matrix, for example collections including splines, wavelets or radial basis 
functions. 



8 Summary and Conclusions 

The LASSO-Patternsearch algorithm brings together several known ideas in a novel way, us- 
ing a tailored tuning and pattern selection procedure and a new purpose built computational 
algorithm. We have examined the properties of the LPS by analysis of observational data, 
and simulation studies at a scale similar to the observational data. The results are verified 
in the simulation studies by examination of the generated "truth" , and in the observational 
data by selective examination of the observational data directly, and data scrambling to 
check false alarm rates, with excellent results. The novel computational algorithm allows 
the examination of a very large number of patterns, and, hence, high order interactions. 
We believe the LASSO-Patternsearch will be an important addition to the toolkit of the 
statistical data analyst. 
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Appendices 



A The BGACV Score 



We denote the estimated logit function by /a(-) and define fxi = f\(x(i)), p\i 
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for i — 1, • • • , n. Now define 

1 ^ 
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n i=i 
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(14) 



Vi - Pxi 

Here o\ { = a 2 (fxi) and the approximation in (fill) follows upon recalling that |j = a 2 . 

Denote the objective function in ^ - (jlj) by I\(y, c), let By = Bj(x(i)) be the entries of 
the design matrix B, and for ease of notation denote \x = cn b - Then the objective function 
can be written 

1 n Nb r^B o x Nb ~ 1 

h(y,c)^-j:[-y i j:c j B ij + log(l + e^ c ^)}+X £ \cj 
i=i j=i j=i 



(15) 
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Denote the minimizer of (TT5|) by c\. We know that the l\ penalty produces sparse solutions. 
Without loss of generality, we assume that the first s components of c\ are nonzero. When 
there is a small perturbation e on the response, we denote the minimizer of I\(c,y + e) by 
c\. The O's in the solutions are robust against a small perturbation in the response. That 
is, when e is small enough, the elements will stay at 0. This can be seen by looking at the 
KKT conditions when minimizing (fl5|) . Therefore, the first s components of c\ are nonzero 
and the rest are zero. For simplicity, we denote the first s components of c by c* and the 
first s columns of the design matrix B by B*. Then let f x be the column vector with % entry 
fx(x(i)) based on data y, and let f x +e be the same column vector based on data y + e. 



fl +e -fi = B{ci-c x ) = B\ct-ci). 



(16) 



Now we take the first-order Taylor expansion of 



\dh 






+ 






dc* 




dc* 


dc*dc*' 





(of - <*)) + 



d 2 h 
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])B* = B*'WB*, say, 



V = -n 
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dc*dy' 

By the first-order conditions, the left-hand side and the first term of the right-hand side of 
( TTT1) are zero. So we have 

U(c x *-c x ))*Ve. (18) 
Combine (fT5"j) and (|18|) we have f x — f x ~ He, where 



H = B*U^V = B*U~ l B 



*TT— 1 D* ; 



(19) 



Now let e be e = (0, • • • , ^ - p [ Xi % \ • • • , 0)'; then /f" £ - fl « i^e*, where e< = y 4 - p Ai l] and 
i/j is the ith column of if. By the Leave-Out-One Lemma (stated below), /a = f\ y+e - 
Therefore 
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where ha is the iith entry of H. From the right hand side of (THj) . the approximate CV score 
is 

CV(X) « I + log(l + e'*)] + \ ± hu *tey>u) (21) 

The GACV score is obtained from the approximate CV score in ()2ip by replacing ^ by 
Hr(H) and by ±tr(Wif). It is not hard to see that tr(WH) = trW 1/2 HW 1/2 = s = 

Nb , the number of basis functions in the model, giving 

GACV(X) = - Y\-Vifxi + log(l + e'")l + -trff g^%^ . (22) 
n f^i n (n- N Bo ) 

Adding the weight | log n to the "optimism" part of the GACV score, we obtain the B-type 
GACV (BGACV): 

BGACV(X) = - ±[-yJ M + log(l + e?»)] + -^trH (23) 
n i=1 n l {n — l\B ) 

Lemma A.l (Leave- Out-One Lemma) 

Let the objective function I\(y, f) be defined as before. Let be the minimizer of I\(y, f) 
with the ith observation omitted and let p\ be the corresponding probability. For any real 
number v, we define the vector z = (yi,---, y%-i, v, j/j+i, • ■ • , y n )' ■ Let h\(i, v, ■) be the mini- 
mizer ofh{z, f); then h x (i,p [ ^ l \ ■) = f[~ A {-). 

The proof of lemma IA~T1 is quite simple and very similar to the proof of the Leave-Out-One- 
Lemma in [38] so we will omit it here. 

We remark that in this paper we have employed the BGACV criterion twice as a stringent 
model selector under the assumption that the true or the desired model is sparse. Simulation 
experiments (not shown) suggest that the GACV criterion is preferable if the true model 
is not sparse and/or the signal is weak. The GACV and the BGACV selections probably 
bracket the region of interest of A in most applications. 

Comments 

A referee has asked how BGACV might be compared to the more familiar BIC = — log likelihood+ 
^fpd/. where df is the degrees of freedom in the case of Bernoulli data. The short answer 
to this question is that an exact expression for df does not, in the usual sense, exist in the 
case of Bernoulli data. Thus, only a hopefully good approximation to something that plays 
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the role of df in the Bernoulli case can be found. This argument, which is independent of 
the nature of the estimate fx of /, is found in Section 2 of [23J. We sketch the main idea. 
Let KL(X) = KL(f, fx) be the Kullback-Liebler distance between the distribution with the 
true but unknown canonical link / and the distribution with link fx and let 

i n n 

CKL(X) = KL(f, fx) - -E Etftfi - &(/,)] = E + Khi) (24) 

be the comparative Kullback-Liebler distance. The goal is to find an unbiased estimate of 
CKL(\) as a function of A, which will then be minimized to estimate the A minimizing the 
true but unknown CKL. Letting OBS(X) = - J27=i ~Vifxi + Hfte) we can write CKL(X) = 
OBS(X) + D(X). where D(X) = ± EtiiVi ~ t*i)fxi- Then E,D(X) = I ELi MVi ~ 
Ye and Wong (1997b) show, in exponential families, for any estimate fx of / 

"1 71 1 K\ €S 

- E ^(Vi - IH)fM = - E °i «-^(/Ai)- (25) 

' 4 i=l i=l 

Here E^.(fxi) is the expectation with respect to conditional on the yj,j ^ i being fixed. 
(Their proof is reproduced in [23J.) Ye and Wong call n times the right hand side of (1251) 
the generalized degrees of freedom (GDF). and it does indeed reduce to the usual trace 
of the influence matrix in the case of Gaussian data with quadratic penalties. Unbiased 
estimates of the GDF can be found for Poisson, Gamma, Binomial distribution taking on 
three or more values, and other distributions, using the results in [13] but Ye and Wong 
show that no unbiased estimate of the GDF in the Bernoulli case exists. Thus, in the 
absence of a bona fide unbiased risk method of estimating the CKL(X) the alternative 
GACV based on leave-one-out to target the CKL has been proposed. Thus, in this paper 
^trif ^'^jVg = E(X), say, plays the role of df. Since no exact unbiased estimate for df 
exists, the issue of the accuracy of the approximations in obtaining D reduces to the issue 
of to what extent the minimizer of GACV(X) is a good estimate of the minimizer of the 
(unobservable) CKL(X). The GACV for Bernoulli data was first proposed in [3B] for RKHS 
(quadratic) penalty functionals, where simulation results demonstrated the accuracy of this 
approximation. Further excellent favorable results for a randomized version of the GACV 
with RKHS penalties were presented in [23]. In [38] the GACV was derived for Bernoulli data 
with l\ penalties with a nontrivial null space, and favorable results for the randomized version 
were obtained. The derivation was rather complicated, and a simplified derivation as well 
as a simpler expression for the result which is possible in the present context are presented 
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above. A recent work involving the LASSO in the Bernoulli case with li penalty uses a tuning 
set to choose the smoothing parameters. SPLR uses tr(B *' WB * +\I*)~ 1 (B *' WB*), 
where / is the diagonal matrix with all l's except in the position of the model constant, as 
their proxy for df in their BIC-like criteria for model selection after fixing A. In the light of 
the Ye and Wong result it is no surprise that an exact definition of df in this case cannot be 
found in the literature. 

B Minimizing the Penalized Log Likelihood Function 

The function (TjQ) is not differentiable with respect to the coefficients {q} in the expansion 
(jl]), so most software for large-scale continuous optimization cannot be used to minimize it 
directly. We can however design a specialized algorithm that uses gradient information for 
the smooth term £(y, f) to form an estimate of the correct active set (that is, the set of 
components q that are zero at the minimizer of (pQ)). Some iterations of the algorithm also 
attempt a Newton-like enhancement to the search direction, computed using the projection 
of the Hessian of £ onto the set of nonzero components q. This approach is similar to the 
two-metric gradient projection approach for bound-constrained minimization, but avoids 
duplication of variables and allows certain other economies in the implementation. 

We give details of our approach by simplifying the notation and expressing the problem 
as follows: 



When T is convex (as in our application), z is optimal for (126]) if and only if the following 
condition holds: 




(26) 




(27) 



for some vector v in the subdifferential of ||z||i (denoted by <9||z||i), that is, 





A measure of near-optimality is given as follows: 



5(z) 



''fc^lk-lli 



min ||VT(z) + Xv 



(29) 
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We have that 5(z) =0 if and only if z is optimal. 

In the remainder of this section, we describe a simplified version of the algorithm used 
to solve (I26p . finishing with an outline of the enhancements that were used to decrease its 
run time. 

The basic (first-order) step at iteration k is obtained by forming a simple model of the 
objective by expanding around the current iterate z k as follows: 

d k = argmin T{z k ) + VT{z k ) T d + -a k d T d + X\\z k + d||i, (30) 

d 2 

where a k is a positive scalar (whose value is discussed below) and d k is the proposed step. 
The subproblem (|30|) is separable in the components of d and therefore trivial to solve in 
closed form, in 0(m) operations. We can examine the solution d k to obtain an estimate of 
the active set as follows: 

A k = {i = l,2,...,m \ (z k + d k ) i = 0}. (31) 

We define the "inactive set" estimate I k to be the complement of the active set estimate, 
that is, 

J fc = {1,2, . . . ,m} \ A k . 

If the step d k computed from (130]) does not yield a decrease in the objective function T\, 
we can increase a k and re-solve (130]) to obtain a new d k . This process can be repeated as 
needed. It can be shown that, provided z k does not satisfy an optimality condition, the d k 
obtained from (130]) will yield T\(z k + d k ) < T\(z k ) for a k sufficiently large. 

We enhance the step by computing the restriction of the Hessian V 2 T(z k ) to the set I k 
(denoted by Vz kXk T(z k )) and then computing a Newton-like step in the X k components as 
follows: 

(V 2 IkI T(z k ) + 5 k I)p k k = -V Ik T(z k ) - Xw Ik , (32) 

where 5 k is a small damping parameter that that goes to zero as z k approaches the solution, 
and Wj k captures the gradient of the term ||z||i at the nonzero components of z k + d k . 
Specifically, w% h coincides with d\\z k + <i fc ||i on the components % G X k . If 5 k were set to zero, 
Pj k would be the (exact) Newton step for the subspace defined by T k ; the use of a damping 
parameter ensures that the step is well defined even when the partial Hessian Vj fcIfc T(2; fc ) is 
singular or nearly singular, as happens with our problems. In our implementation, we choose 

5k = min (5(z k ), mean diagonal of V| feXfc T(2 fc )^ , (33) 
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where S(z) is defined in (1291) . 

Because of the special form of T(z) in our case (it is the function C defined by §2$) and 
(j3J)), the Hessian is not expensive to compute once the gradient is known. However, it is 
dense in general, so considerable savings can be made by evaluating and factoring this matrix 
on only a reduced subset of the variables, as we do in the scheme described above. 

If the partial Newton step calculated above fails to produce a decrease in the objective 
function T\, we reduce its length by a factor 7^, to the point where z\ + 'jkPi has the same 
sign as z\ for all i G Ik- If this modified step also fails to decrease the objective T\, we try 
the first-order step calculated from (I30l) . and take this step if it decreases T\. Otherwise, we 
increase the parameter a^, leave z k unchanged, and proceed to the next iteration. 

We summarize the algorithm as follows. 

Algorithm B.l 

given initial point z° , initial damping qq > 0, constants tol > and t] e (0, 1); 
for k = 0,1,2,... 

if S(z k ) < tol 

stop with approximate solution z k ; 

end 

Solve for d k ; (* first- order step *) 
Evaluate Au and X^; 

Compute pj k from (* reduced Newton step *) 

Set z} k = z\ + p\ k and z\ k = 0; 

if T\(z + ) < mm(Tx(z k + d k ),T\(z k )) (* Newton step successful *) 
z k+i ^ z +. 

else 

Choose 7fc as the largest positive number such that (z k + ^kP k )iZ k > 

for all i with z k 7^ 0; (* damp the Newton step *) 
Set z^ k = z\ h + 7 fc p| fe and z\ k = 0; 

if T\(z + ) < min(T\(z k + d k ),Tx(z k )) (* damped Newton step successful *) 
z k+i ^ z +. 

else if T\(z k + d k ) < T\(z k ) (* first- order step successful; use it if Newton steps have 
failed *) 

z k+i ^_ z k _|_ ^k. 
else (* unable to find a successful step *) 
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end 
end 

(* increase or decrease a depending on success of first- order step *) 
if T x (z k + d k ) < T x (z k ) 

ctk+i <— i] a k; (* first- order step decreased T\, so decrease a *) 
else 

a k+1 <— a k /rj; 
end 
end 

We conclude by discussing some enhancements to this basic approach that can result in 
significant improvements to the execution time. Note first that evaluation of the full gradient 
VT(z k ), which is needed to compute the first-order step (130]) can be quite expensive. Since 
in most cases the vast majority of components of z k are zero, and will remain so after the 
next step is taken, we can economize by selecting just a subset of components of VT(z k ) 
to evaluate at each step, and allowing just these components of the first-order step d to be 
nonzero. Specifically, for some chosen constant a G (0,1], we select am components from 
the index set {1, 2, ... , m} at random (using a different random selection at each iteration), 
and define the working set V\4 to be the union of this set with the set of indices i for which 
z k 7^ 0. We then evaluate just the components of VT(z fc ) for the indices % G V\4, and solve 
(I30p subject to the constraint that di = for i ^ V\4- 

Since 5(z k ) cannot be calculated without knowledge of the full gradient VT(z k ), we define 
a modified version of this quantity by taking the norm in (|2"§]) over the vector defined by 
Wfc, and use this version to compute the damping parameter 5 k in ( I331) . 

We modify the convergence criterion by forcing the full gradient vector to be computed 
on the next iteration k + 1 when the threshold condition 8(z ) < tol is satisfied. If this 
condition is satisfied again at iteration k + 1, we declare success and terminate. 

A further enhancement is that we compute the second-order enhancement only when the 
number of components in I k is small enough to make computation and factorization of the 
reduced Hessian economical. In the experiments reported here, we compute only the first 
order step if the number of components in 1 k exceeds 500. 
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C Results of Simulation Example 3 



Table 7: Results of Simulation Example 3. In each row of a cell the first three numbers are 
the appearance frequencies of the important patterns and the last number is the appearance 
frequency of noise patterns. 










0.2 


0.5 


0.7 





LPS 
Logic 
SPLR 


96/100/100/54 
100/98/96/120 
100/100/100/527 


98/100/100/46 
98/95/93/107 
100/100/98/525 


100/100/100/43 

99/94/92/83 
100/100/98/487 


100/100/100/44 
100/98/83/134 
100/100/97/489 


0.2 


LPS 
Logic 
SPLR 


99/100/100/46 

99/97/94/96 
100/100/94/517 


100/100/100/49 
100/99/87/94 
100/99/96/530 


100/100/100/39 
100/100/88/73 
100/97/95/495 


100/100/98/36 
100/99/86/117 
100/100/96/485 


0.5 


LPS 
Logic 
SPLR 


99/100/99/47 
99/96/86/162 
100/98/75/548 


99/100/100/51 
100/95/87/109 
100/96/80/552 


100/100/99/51 
100/96/78/122 
100/99/80/531 


100/100/98/46 
100/99/80/143 
100/98/78/518 


0.7 


LPS 
Logic 
SPLR 


100/99/96/44 
100/83/70/195 
100/91/51/580 


99/99/97/51 
100/88/69/167 
100/85/49/594 


100/99/96/67 
100/85/70/153 
100/81/52/584 


100/99/94/65 

100/89/74/126 

100/72/55/570 



D Effect of Coding Flips 

Proposition: Let f(x) = ji + c jij 2 -j r ^jij2--jr( x ) with all Cj 1 j 2 „j r which appear in the sum 
strictly positive. If Xj — > 1 — Xj for j £ some subset of {1, 2, p} such that at least one re- 
appears in /, then the resulting representation has at least one negative coefficient and at 
least as many terms as /. This follows from the 

Lemma: Let gkix) be the function obtained from / by transforming Xj — > 1 — Xj, 1 < 
j < k. Then the coefficient of Bj 1 j 2 j r (x) in gk{x) is 

(_l)|{iiv.,>}n{i,...,fc}| V- Ct 

0'l,-,>}CTC{ii,-,i r }U{l,-fe,} 
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where | • | means number of entries. 
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